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Abstract. We extend the DSV method of computing the growth series of an 
unambiguous context-free language to the larger class of indexed languages. 
We illustrate the technique with numerous examples. 



1. Introduction 

1.1. Indexed grammars. Indexed grammars were introduced in the thesis of 
Aho in the late 1960s to model a more natural subclass of context-sensitive lan- 
guages with interesting closure properties [1] [M] , but which is more expressive than 
context-free grammars. The original reference for basic results on indexed gram- 
mars is [1] . The complete definition of these grammars is equivalent to the following 
reduced form. 

Definition. A reduced indexed grammar is a 5-tuple (Vj'S, I, P, S), where F is a 
set of variables (also referred to as non-terminals), S the set of terminals, / the set 
of indices, S G the start symbol, and P a finite set of productions of the form 

(1.1) A — > a A->B/ Af-^a 

where A,B G F, / e / and a E {V U S)*. Derivations are similar to those 
in a context-free grammar, except that each variable is associated to a dedicated 
stack, which we indicate here with a subscript. The expressive power of an indexed 
grammar comes from the treatment of the stack. For example, in productions of 
the form A BC the stack of A is copied to both B and C. Thus, each rule 
is essentially a shorthand for an infinite family of derivation rules. The other two 
rules are used to pop and push onto the stack, respectively. We write variables in 
bold upper case, terminals in lower case, and indices by subscript. It is convenient 
to have a special symbol to mark the bottom of the index stack (corresponding to 
the rightmost subscript in an index string). We use $ for this purpose. 

This class of languages properly includes all context-free grammars: these are the 
production rules with no indices. Furthermore, it is a proper subset of the class of 
context-sensitive languages (for instance {(a&")" : n > 0} is context-sensitive but 
not indexed [ID]). 

As alluded to above, this is a full abstract family of languages which is closed un- 
der union, concatenation, Kleene closure, homomorphism, inverse homomorphism 
and intersection with regular sets. The set of indexed languages, however is not 
closed under intersection or complement. 
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This class of languages is strictly larger than the class of context-free languges 
since it contains the language {a"&"c" : n > 0}. A corresponding indexed grammar 
is given by 

S ^ aA$c A^aAfC A — > B B$ -!> & B/— J>&B. 

We remark that for a typical derivation there is an initial pushing/loading stage 
to build up the stack, followed by a transfer stage; and finally a popping/unloading 
stage, to convert the indices into terminal symbols. The set of all languages gen- 
erated by indexed grammars forms the set of indexed languages. The standard 
machine type that accepts the class of indexed languages is the nested stack au- 
tomaton. 

Formal language theory in general and indexed languages in particular have ap- 
plications to group theory. Two good survey articles are [22] and [H] . Bridson and 
Gilman |? have exhibited indexed grammar combings for fundamental 3-manifold 
groups based on Nil and Sol geometries (see Example [Til below). More recently 
showed that the language of words in the standard generating set of the Grigorchuk 
group that do not represent the identity (the so-called co-word problem) forms an 
indexed language. The original DSV method (attributed to Delest, Schiitzenberger, 
and Viennot [5]) of computing the growth of a context-free language was success- 
fully exploited [S^ to compute the algebraic but non-rational growth series of a 
family of groups attributed to Higman. One of our goals is to extend this method 
to indexed grammars to deduce results on growth series. 

1.2. Ordinary generating functions. Generating functions are a useful tool to 
treat enumerative questions about languages over finite alphabets, in particular 
the number of words of a given length. They permit a natural translation between 
combinatorial rules and functional equations. For any language £, let L„ be the 
number of words of length n. The ordinary generating function of the language 
is the formal power series L{z) = X]n>o-^"-^"- ^^^^ terminology inter- 

changably with growth series. This is a natural object to study in this context 
because of parallels of structure: regular languages have rational generating func- 
tions of the form L{z) = P{z)/Q{z) where P and Q are polynomials with integer 
coefficients. Unambiguous context-free languages have algebraic generating func- 
tions which satisfy P(L{z), z) for some bivariate polynomial P{x,y) with integer 
coefficients. The notion of complexity augments in parallel. One natural question 
to ask is the nature of the ordinary generating functions of languages derived from 
indexed grammars. Two natural candidates are D-finite, and differentiably alge- 
braic. A formal series in C[[z]] is said to be D-finite if it satisfies a homogeneous 
linear differential equation with polynomial coefficients. This is equivalent to a 
series with P-recursive coefficients. These series appear frequently as generating 
functions of structured combinatorial objects, although no direct interpretation is 
known. The class strictly contains the set of algebraic series. Most of our exam- 
ples here are not D-finite because they contain an infinite number of singularities, 
a property inconsistent with D-finiteness. Indeed, many of our examples are are 
lacunary series, which have a natural boundary at the unit circle. For more on the 
parallel, and on D-finite functions, consult [8i Appendix B.4]. A series L{z) is said 
to be differentiably algebraic if there is a A; -|- 1-variate polynomial P(xo, xi, . . . , Xk) 
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with integer coefficents such that P \^L{z), -^L{z), j^L{z), . . . , -j^L{z)j = 0. Al- 
though this is a very wide class, at least one of our examples does not lie in this 
category. 

1.3. Summary. Ideally, we would like to describe an efScient algorithm to de- 
termine the generating function of a given indexed grammar. We do describe a 
process in the next section that works under some conditions including only one 
stack symbol. Lemma [2] summarizes the conditions, and the results. We have sev- 
eral examples to illustrate the procedure. In Section[3]this is generalized to multiple 
stack symbols that are pushed in order. In this section we also illustrate the in- 
herent obstacles in the case of multiple stack symbols. This is followed by some 
further examples from number theory in Section |4] and a discussion in Section [5] on 
inherent ambiguity in indexed grammars. 

1.4. Notation. We use standard terminology with respect to formal language the- 
ory. The expression x\y denotes "x exclusive-or y". We use epsilon "e" to denote 
the empty word. The Kleene star operator applied to x, written x* means make 
zero or more copies of x. A related notation is x^ which means make one or more 
copies of X. The word reversal of w is indicated by . The length of the string x is 
denoted \x\. We print grammar variables in upper case bold letters. Grammar ter- 
minals are in lower case italic. We use the symbol to indicate the composition 
of two or more grammar productions. 

2. The method: one index symbol 

2.1. Generalizing the DSV process. Our starting point is [1]. Let G5 be a 

context-free grammar for some non-empty language £, with start symbol S. Re- 
place each terminal with the formal variable z and each non-terminal V with a 
formal power series V{z). Translate the grammar symbols |, e into =, +, 1, 
respectively, with juxtaposition becoming commutative multiplication. Thus the 
grammar is transformed into a system of equations. We summarize their main 
result as follows. 

Theorem. Each formal power series V{z) = ^ VnZ^ in the above transformation 
is an ordinary generating function where Vn is an integer representing the number 
of word productions of length n that can be realized from the non-terminal V. In 
particular, if the original context-free grammar is unambiguous, then S{z) is the 
growth series for the language £, in which case S{z) is an algebraic function. 

A context-free grammar has only finitely many non-terminals. In the case of an 
indexed grammar, we treat a single variable V as housing recursively many non- 
terminals, one for each distinct index string carried by V (although only finitely 
many are displayed in parsing any given word). To generalize the DSV procedure 
to indexed grammars we apply the same transformation scheme to the grammar 
except that each variable with index string becomes a distinct function (for example 

a fghgfS becomes Vgfghgf${z)). 

The transformation recipe produces a system of infinitely many equations in 
infinitely many functions. It is not immediately apparent that such a system can 
be solved for S{z), indeed we do not always obtain satisfying expressions. Let us 
illustrate a successful instance of the procedure with a simple example. 
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Example 1. The language £,sqr = {O^ : 71 > 0} is generated by an indexed gram- 
mar. We use $ to indicate the bottom-most index symbol. Disregarding this, there 
is only one index symbol actually used: 

S^T$ T^T/|D D/^DD D$ 

Observe that indices are loaded onto T then transferred to D which is then repeat- 
edly doubled (D is a mnemonic for "duplicator "). After all doubling, each instance 
of D$ becomes a zero. 

The index strings used are particularly simple, consisting of a number of /'s 
followed by one $. This allows us to denote the functions Vfn${z) by the simpler 
notation Vn{z) for V E {T,D} and n > 1. The grammar rule — > DD implies 
the functional equality Dn+i{z) = D'^{z) and thus Dn{z) = Dq"' (z) = z^". 

The system of grammar equations become 

Siz) = Toiz) = ri(z) + Do{z) =T2 + Di+Do = --- = J2 Dn{z) = ^ z^" 

n>0 7i>0 

where we observe that the sequence of partial sums converge as a power series. 
We refer to this process, summarized in Lemma [5] below, as pushing the Tn{z) off 
to infinity. Note that S{z) is not an algebraic function. In fact, S{z) satisfies no 
algebraic differential equation [T6) . 

2.2. A straightforward case: Balanced indexed grammars. Next, we de- 
scribe a condition that allows us to guarantee that this process will result in a 
simplified generating function expression for S{z). 

Assume V is a non-terminal in an indexed grammar 25. We say V loads or 
pushes indices if © contains a production having form V — > LTV flA' where U and 
W each denote a (possibly empty) string of terminals and/or non-terminals and / 
is an index symbol. We say that / is the index symbol that is loaded or pushed 
onto V. Unloading or popping indices means a production having form — > U, 
where is a nonempty string of terminals and/or non-terminals. Call an indexed 
grammar reduced if at most one index symbol is loaded or unloaded for any given 
production and there are no useless non-terminals (every non-terminal V satisfies 
both S UWM' and w for some index string a and some terminal 

string w) . A grammar is e-free if the only production involving the empty string e 
is S — e. Indexed grammars ©1 and ©2 are equivalent if they produce the same 
language £. 

Theorem (|18|). Every indexed grammar © is equivalent to some reduced, e-free 
grammar ©'. Furthermore, there is an effective algorithm to convert © to & . 

Consequently we will assume all grammars are already reduced (most of our 
examples are). We have found that e-productions are a useful crutch in design- 
ing grammars (several of our examples are not e-free). Our methods require an 
additional hypothesis. 

Definition. An indexed grammar is balanced provided there are constants C, K > 
0, depending only on the grammar, such that the longest string of indices associated 
to any non-terminal in any sentential form W has length at most C\w\ -\- K where 
w is any terminal word produced from W. (Note: in all our balanced examples we 
can take C — I and K e {0, 1}.) 
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Lemma 2. Let & he an unambiguous balanced indexed grammar for some non- 
empty language £, thai involves only one index symbol (say f ) and suppose V is 
the only non-terminal that loads f. Then in the generalized DSV equations for 65, 
the sequence of functions Vn{z) can be eliminated (pushed to infinity), where n > 
refers to the number of f indices on V . Under these hypotheses, the system of 
equations defining S{z) reduces to finitely many bounded recurrences with initial 
conditions whose solution is the growth function for 2. 

Proof. Consider all productions in © that have V on the left side. Converting these 
productions into the usual functional equations gives an equation of form either 

Vn{z) = Vn+l{z) + Wn±e{z) OV Vn{z) = Vn+l{z)Wn±e{z) 

(here Wn±e (z) denotes an expression that represents all other grammar productions 

having V on the left side while e G {0,1,-1}). Consider the equation Vn{z) = 
Vn-{-i{z) + Wn±e{z)- Fix e = for the remainder of this paragraph. Starting with 

n = and iterating A'' > times yields Vo{z) = Vn{z) + Wo{z) -\-Wi{z)-\ h 

Wn{z). By the balanced hypothesis, there exists a constants C,K > such that 
all terminal words produced from Vjn have length at least N/C — K ^ 0. This 
means that the first N/C — K terms in the ordinary generating function for Vq {z) 
are unaffected by the contributions from V]s[{z) and depend only on the fixed sum 

Wo{z) + VFi(z) + h Wn{z). Therefore the {N/C - Kf^ partial sum defining 

the generating function for Vo{z) is stabilized as soon as the iteration above reaches 
Vn{z)- This is true for all big N, so we may take the limit as A'' ^ oo and express 

If the equation for loading of index / takes form Vn{z) = Vn+i{z)Wn{z) , a 
similar argument derives Vo{z) = rin>o^"(-^)- Allowing e = ±1 in cither form 
merely shifts indices in the sum/product and does not affect the logic of the argu- 
ment. Therefore, in all cases the variables Vn{z) are eliminated from the system of 
equations induced by the grammar ©. Wc assumed that V was the only variable 
loading indices, so all other grammar variables either unload/pop indices (yield- 
ing finitely many finite recurrences of bounded depth) or evaluate as terminals 
(supplying recurrence initial conditions). The solution of this simplified system is 
S{z). □ 

It turns out that the balanced hypothesis used above is already satisfied. 

Lemma 3. Suppose the indexed grammar 25 is unambiguous, reduced, e-free, and 
has only one index symbol f (other than $). Then © is balanced. 

Proof. If the language produced by 25 is finite, there is nothing to prove so we 
may assume the language is infinite. Let us define a special sequence of grammar 
productions used in producing a terminal word. Suppose a sentential form contains 
several variables, each of which is ready for unloading of the index symbol /. A 
step will consist of unloading a single / from each of these non-terminals, starting 
from the leftmost variable. After the step, each of these variables will hold an index 
string that is exactly one character shorter than before. 

Consider a sentential form F containing one or more non-terminals in the un- 
loading stage and each of whose index strings arc of length at least A^ >> 0. These 
symbols can only by unloaded at the rate of one per production rule (this is the 
reduced hypothesis) and we'll only consider changing F by steps. 
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On the other hand there, are only finitely many production rules to do this 
unloading of finitely many variables. Thus for large N there is a cycle of production 
rules as indices are unloaded, with cycle length bounded by a global constant C > 
which depends only on the grammar. Furthermore, this cycle is reached after at 
most K < C many productions. Let F' denote the sentential form that results 
from F as one such cycle is begun and let F" be the sentential form after the cycle 
is applied to F' . 

Consider lengths of sentential forms (counting terminals and variables but ignor- 
ing indices). Since the grammar is reduced and e-free, each grammar production 
is a non-decreasing function of lengths. Thus > \F'\ > Discounting any 
indices, the equality of sentential forms F" = F' is not possible because this implies 
ambiguity. 

We claim that either F" is longer than F' or that F" has more terminals than F'. 
If not, then F" has exactly the same terminals as F' , and each has the same quantity 
of variables. There are only finitely many arrangements of terminals and variables 
for this length and for large N we may loop stepwise through the production cycle 
arbitrarily often and thus repeat exactly a sentential form (discounting indices). 
This implies our grammar is ambiguous contrary to hypothesis. 

Thus after C steps the sentential forms either obtain at least one new terminal 
or non-terminal. In the latter case, variables must convert into terminals on $ (via 
the reduced, unambiguous, e-free hypotheses). There will be at least one terminal 
per step in the final output word w. We obtain the inequality {N — K)/C < \w\ 
which establishes the lemma. □ 

2.3. A collection of examples. Wc illustrate the method, its power and its lim- 
itations with three examples: the first two are examples with intermediate growth, 
and one which we are unable to resolve to our satisfaction. 

The first example is originally due to 112^ and features the indexed language 
£,c/M — {a6*ia6'^ ■ • • ab^'" : < ii < i2 < ■ • • < *fe} with intermediate growth (mean- 
ing that the number of words of length n ultimately grows faster than any poly- 
nomial in n but more slowly than 2*^" for any constant fc > 0). The question 
of whether a context-free language could have this property was asked in [J and 
answered in the negative [15l [3]. Grigorchuk and Machi constructed their lan- 
guage based on the generating function of Euler's partition function. A word of 
length n encodes a partition sum of n. For instance, the partitions of n = 5 are 
1-l-l-t-H-l-l-l, l-hl-t-l-h2, 1 + 2 + 2, 1 + 1 + 3, 2 + 3, 1+4, 5. The corresponding 
words in £q/]^,j are aaaaa, aaaab, aabab, aaabbb, ababb, aabbb, a&&&&, respectively. 
The derivation below is ours. 

Example 4. An unambiguous grammar for £,c/m is 

S^T$ T^T/|GT|G G/-^G6 G$ ^ a 

The latter two productions imply that G/™$ ab"^ or in terms of fimctions 
Gm{z) = z™+^. A typical parse tree is illustrated in Figure [2Tl 
The second grammar production group transforms to 

T„(z) T„+i(z) + G„(z)r„(z) + Gjn{z) . 

Substitution and solving for Tm gives 

m+l , J. / N 
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Figure 2.1. A typical parse tree in the grammar 

S^T$ T^T/|GT|G G/^G& G$ ^ a 



Iterating this recurrence yields a kind of inverted continued fraction: 



, rr /■ \ ~ I z +T2(z) „ I ' ' i-i-^ 

S[z)^To(z)- - - 



1 — z 1 — z 1 ~ z 

Equivalently, this recurrence can be represented as 

z + Ti{z) _ z{l - z^) + z^ + T2{z) _ z(l - z2)(l - z3) + ^2(1 _ ^3) _,_ ^3 + T3(z) 

l-Z " (1~Z)(1-Z2) ~ (l-z)(l-z2)(l-z3) 



or 



b{z) — 



1-Z (l-z)(l-z2)(l_^3) nLl(l-^") 

Apply Lemma [2] to push Tk{z) off to infinity: 

Here we recognize a classic combinatorial summation of partitions in term of their 
largest part O Example 1.7] . Thus, we have recovered the ordinary generating func- 
tion for partitions, S{z) = J2n>iPi^)^"' where the coefficients belong to Euler's 
partition sequence p(n). Since we can also write 



n>l n>l 
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it is true that S{z) has a dense set of singularities on the unit circle and is not 
D-finite. 

Example 5. Another series with intermediate growth can be realized as the ordi- 
nary generating function of the following indexed grammar: 

S^C|CT$ C^6C|e T ^ T/|W W/ ^ VWX 
Yf^aaY V$ ^ aa W$ — > a X ^ a|& 

As usual, we use index $ to indicate the bottom of the stack, with / being the only 
actual index symbol. A typical parse tree is given in Figure [221 From this we see 
that the language generated (unambiguously) is 

£»t = {b* [e I a"'+"+i (a|&)") : n ^ o} . 

We can derive the generating function in the usual fashion. Note the shortcuts 
Xj»i$ — > {a\b) (regardless of indices) and Y fn$ A a^"V$ a^"+^. Starting with 

^ Yf$Wf$Xf$ A a'^Wf${a\b) a'^Y$W$X${a\b) A aVa(a|&)2 
one can use induction to derive 

Wy„$ A a^" • • • a^a^aia\br = a"("+i)a(a|6)" = a"'+"+i(al6)". 
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In terms of generating functions these shortcuts imply Wn{z) = +"+i2"z'^ 
2"z("+i)'; also C{z) = y^. Put this ah together to get 

S{z) = C(z) + C{z)To{z) = C (1 + To) - C (1 + Ti + Wo) = 
= C(l + T2 + Wo + M^i) = • • ■ = C ( 1 + ^ ) 



j^(l + ±2-zi--^A. 

\ n=0 I 



ri=0 



Write as a sum of rational functions and expand each geometric series: 



1+z + + z-'^+z'' -^z^ 
+z + + z-'^+z'' +z'^ +z^ 

+2z4+2z5+2z6+2z'^+2z«+2z9 

+4z^ 



and so forth. Sum the columns and observe that the coefficient of each z" is a 
power of 2, with new increments occurring when n is a perfect square. Thus 



S{z) = y2Lv^J; 



n=0 

and the coefficient of z" grows faster than any polynomial (as n oo) but is 
sub-exponential. 

The indexed grammars used in applications (combings of groups, combinatorial 
descriptions, etc) tend to be reasonably simple and most use one index symbol. 
Despite the success of our many examples, Lemma 2 does not guarantee that an 
explicit closed formula for S'(z) can always be found. 

Example 6. Consider the balanced grammar below. 

S-J-Ts T^T/|N N/^aN|5^M M/ ^ a6NM M$,N$^e 

The hypotheses of Lemma [2] are satisfied so we can push T to infinity and obtain 

5(z) = To - iVo + Ti = iVo + A^i + T2 = • ■ • = 51 ^«(^) • 

n>0 

However, the recursions defining N„{z) are intertwined and formidable: 

iV„(z) ^ zNn-i + z^Mn-1 and M„(z) = z^N^^iMn-i Vn > 1 

with A^o = 1 = Mq. It is possible to eliminate M but the resulting nonlinear 
recursion 

N^{z) = zN„_i + z2Ar„_iiV„_2 - 

does not appear to be a bargain (it is possible that a multivariate generating func- 
tion as per Example 1141 may be helpful). The reader is invited to invent grammars 
of this type with more difficult recursions. 
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Figure 3.1. A typical parse tree in the grammar 

S^Tj T^TglU/ U^U/|VR|V R ^ VR|V 
V^aBC Bf->B6 B, e Cf^Cc C„ ^ cC C$ 



3. The case of several index symbols 
Our next example uses two index symbols (in addition to $) in an essential way. 

Example 7. Define £,seriai = |(a6'c-')^ : 1 < ^ < j|. Consider the grammar: 

S^T$ T-^Tg|U/ U^U/|VR|V R ^ VR|V 
V ^ aBC Bf ^Bb Bg^e C/ ^ Cc Cg ^ cC C$ ^ e 

Observe that the two index symbols are loaded serially: all g's are loaded prior 
to any / so each valid index string will be of the form f^g*$. We also have the 
shortcuts C/™g„$ c™Cg„$ ^ c"+" and Bf^g,.$ ^ &"'Bg„$ ^ fo^e = &™ 
and consequently V^,Tig„$ aB f^g,^$C f„^gn$ a6™c™+". A typical parse tree 
is given in Figure [3?T] 

The special form of the index strings ensures that such a string is uniquely iden- 
tified solely by the number of /'s and number of g's it carries. Consequently, the 
induced function Vjmg»i$(z) can be relabeled more simply as Vm,n{z), and simi- 
larly with functions T, [/, R. Working in the reverse order of the listed grammar 
productions, we have the identities 

= ^""+"+' and i?™, ^"^ 



The grammar production U — > U/|VR|V implies for fixed n > that 
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The hypothesis of Lemma [2] are satisfied in that we are deahng with a balanced 
grammar where currently only one index symbol is being loaded onto one variable. 
Therefore for fixed n we can push Um.niz) off to infinity and obtain 



Ul.n{z) = ^ {Vrn,nRm,n + Vm.n) = ^ Rm.n{z) = ^ — 



m>l 



Our general derivation proceeds as follows: 



-,2m+n+l 



k 



S{z) = To,o = Tq.i + Ui,o = To.2 + Ui.i + C/i,o = • • • = To,k+i + J2 ^i^^ 

n=l 

Lemma [2] can be invoked again to eliminate T. We find that 

2m+n+l 3j 3i _|_ 

si^) = E E 



j> 



with the latter two summations realized by expanding geometric series and/or 
changing the order of summation in the double sum. In any event, S{z) has infin- 
itely many singularities on the unit circle and is not D-finite. 

It is worth noting the reason why the copying schema used above succeeds in 
indexed grammars (but fails for context-free grammars). The word ab^c' is first 
encoded as an index string attached to V and only then copied to VR (the grammar 
symbol R is a mnemonic for "replicator"). This ensures that ab^c^ is faithfully 
copied. Slogan: "encode first, then copy". Context-free grammars are limited to 
"copy first, then express" which does not allow for fidelity in copying. 

We would like to generalize the previous example. The key notion was the 
manner in which the indices were loaded. 

Definition. Suppose that is an unambiguous indexed grammar with index alpha- 
bet / = {/i, /2, . . . , /n} such that every every index string a has form /,t/,t-i ■ ■ ■ fi^- 
We say the indices are loaded serially in 25. 

Corollary 8. Assume is an unambiguous balanced indexed grammar with vari- 
able set V, non-terminal alphabet S, and index alphabet I — {/i, . . . , /„}. Suppose 
all indices are loaded serially onto respective variables T^^^ , . . . , T^"^ and in the 
indicated order. Then each function family T^-'^(z) can be eliminated ("pushed to 
infinity") and the system of equations defining S{z) can be reduced to finitely many 
recursions as per the conclusion of Lemma \^ 

Proof. We have assumed that T^"^ is the last variable to load indices and is loaded 
with /„ only, so T^"^ carries an index string a of form /* • • • $. Indeed, any 
grammar production having T''"^ on the left side will have a right side of two types: 
a string Ul € (1/|E)* that loads /„ onto T{">or a string U2 6 (F IS)* without any 
loading of indices. Neither of these types will include any variable T^^'^ for j < n 
(by the serial loading hypothesis) nor will there be any unloading of indices (by 
the unambiguous hypothesis). Consequently, the equations having T^"^(z) on the 
left side have on their right side products and sums involving no T^^^ (z) for j < n 
but only T^"'^(z) and functions that define finite recurrences. The hypotheses of 
Lemma [5] apply to this situation and T^"! can be pushed to infinity. 
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The previous paragraph is both the basis step and induction step of an obvious 
argument that eliminates T^"-^} then T^^-^} and so on till T^^^ . □ 

In the case of a grammar with multiple index symbols, we would like to be able to 
replace an unwieldy expression like ■Agfghgf$iz) with ^2.3,1 (-z) where the subscripts 
indicate two occurences of /, three of g, and one h. This is certainly possible for 
a grammar with only one index symbol (excluding the end of stack marker $) or 
several symbols loaded serially as per the previous Corollary, but is not possible in 
general. 

Example 9. (Ordering matters) Consider the language £,ord generated by the 
indexed grammar below. 

S^T$ T^T„|T^|N N„^aN ^ 6N6N6 N$ e 

When applying the DSV transformations to this grammar we would like to write 
iVi^i(z) as the formal power series corresponding to the grammar variable N with 
any index string having one a index and one /3 index, followed by the end of stack 
marker $. Note the derivations S Nq,^$ abbb and S ^paS babab. 
Even though both intermediate sentential forms have one of each index symbol, 
followed by the end of stack marker $, they produce distinct words of differing 
length. Thus using subscripts to indicate the quantity of stack indices cannot work 
in general without some consideration of index ordering. 

We note that the grammar is reduced and balanced. It is also unambiguous, 
which can be verified by induction. In fact, if cr e (aj/?)* $ is an index string such 
that No- w where ui is a terminal word of length n, then Nq,(j aw and 
N/3(j bwbwb where \aw\ = n + 1 and Ibmbiub] = 2n + 3. Suppose that all words 
w G S^ord of length n or less are produced unambiguously. Consider a word v of 
length n + 1. Either v = aw or v — bw'bw'b for some shorter words w, w' £ 2,ord 
that were produced unambiguously by hypothesis. Clearly neither of these forms 
for V can be confused since one starts with a and the other with b. 

The proof of Lemma [2] can be applied to eliminate the Ta{z). Solving for S{z) 
via the generalized DSV procedure gives 

5(z) = ^iV.(z) 

where the sum is over all index strings a. It is unfeasible to simplify further because 
the number of grammar functions Na{z) grows exponentially in the length of a 
without suitable simplifying recursions. 

The previous example showed two non-terminals having index strings with the 
same quantity of respective symbols but in different orders leading to two distinct 
functions. We can define a condition that ensures such functions are the same. 

Definition. Let A = {ai, 02, . . . , an} denote a finite alphabet. The Parikh vector 
associated to a & A records the number of occurrences of each ai in a as Xi in the 
vector [xiX2, ■ ■ ■ Xn] (see [19]). Define two strings cr, t e ^ to be Parikh equivalent 
if they map to the same Parikh vector (in other words r is a permutation of a). 
When A is the index alphabet for a grammar, we extend this idea to non-terminals 
and say Vo- is Parikh equivalent to V,- if a and r map to the same Parikh vector. 
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The following lemma gives sufficient conditions that allow simplifying the order- 
ing difficulty for function subscripts. Its proof is immediate. 

Lemma 10. Assume <3 is a nontrivial balanced indexed grammar. Suppose each 
pair of Parikh equivalent index strings a, r appended to a given grammar variable 
V result in identical induced functions Va{z) = Vt{z). Then the functions induced 
from V can be consolidated into equivalence classes (where we replace index string 
subscripts by their respective Parikh vectors) without changing the solution S{z) of 
the system of DSV equations. 

We have already used this lemma in Example[7]above. We illustrate with another 
example. 

Example 11. Consider the non-context-free language £doufc;e = {ww : w G (a|6)*} 
produced by the grammar 

S^T$ T^Ta|T^|RR R„^aR R^i ^ 6R R$ e . 

Suppose (T, T G (q!|/3)*$ are Parikh equivalent index strings of length n. It is clear 
that Ra{z) — z" — Rr{z). In fact, every string u G (a|/3)"$ implies = z", 

regardless of the particular distribution of a and /3 in u. Instead of using the 
equivalence classes Ri,j{z) where [i,j] is the Parikh vector for u, let Rn{z) denote 
the equivalence class of of all such induced functions Ru{z) where u G and 
define Tn{z) similarly. We will abuse notation and refer to the elements of these 
classes as Rn{z) or T„(z), respectively. The grammar equations become 

S{z) =To^2Ti+Rl^ Rl + 2 (2T2 + Rj) = RI + 2RI + AR^ + ■ ■ ■ 
where we can push the Tn{z) to infinity as per the proof of Lemma [2l Therefore 

S{z) = 2-Rl{z) = ^ 2".2" = ^-L-^ . 

n>0 n>0 

4. Further examples related to number theory 
In addition to our example from [12, we have the following 

Example 12. Define 2div = {a" (6")* : n > O} which is generated by the unam- 
biguous balanced grammar 

S ^ T$ T ^ T/ 1 A/R/ R ^ BR|£ 
A/ — > a A A$ ^ e B/ -> &B B$ e 

We see some familiar shortcuts: Aj,i$ a" and Bjn$ — >■ In terms of functions 
this means A„(z) — z" — Bn{z) and furthermore i?„ = -B„i?„ -I- 1 implies Rn{z) = 
,"'"„ . Thus our main derivation becomes 

n 

S{z) = To = Ti + AiRi = + A^R^ + A2R2 = --- = J2 = J2 ' 

n>l n>l 
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Expand each rational sumniand into a geometric series and collect terms 



Z+Z^ +Z'^ +Z^ +Z^ +2" +Z'+Z^ + Z-' 



+z' 



+z'' 



^10 



We see the table houses a sieve of Eratosthenes and we find that 



S{z) ^ z + 2z^ + 2z^ + 3z'' + 2z^ + Az^ + 



ri>l 



r(n)z" 



where T(n) is the number of positive divisors of n. Again, S{z) has infinitely many 
singularities on the unit circle and is not D-finite. 

Example 13. Let Zcomp — {O'^ : c is composite} denote the composite numbers 
written in unary. A generative grammar is 



T/|R 



with sample derivation 



R 



OA 



Rj„$ (Aj.„$) 



R ^ RA|AA 

^ 



(A/..$) 



m+l 



Q(n+l)(m+l)^ 



This is certainly an ambiguous grammar because there is a distinct production 
of 0'^ for each nontrivial factorization of c. (Note: one can tweak the grammar 
to allow the trivial factorizations 1 • c and c ■ 1. The resultant language becomes 
the semigroup 0+ isomorphic to Z+ but the generating function of all grammar 
productions is the familiar ^T(n)z" which we saw in Example [T2l ) 

Suppose we want the generating function for the sum of positive divisors ^ cr(n)z"? 
Then our table expansion above would look like 

S{z) = z+2z2+3z3+4z4 
+4z4 



-52" 



-62^- 



= 2+422+62^+122''+... 
and now each row has closed form -jjz-^- Our goal is to modify the grammar 
of the previous example to obtain ^ (j^l^^yi ■ first glance it would seem that 
we only need replace the grammar rule T T/|A/Ry with T T/|A/RyR/ 
which replaces S{z) = ^ AnRn with ^ A„_R„i?„ . However this creates ambiguity 
because we can produce ab in two ways: S Af$Rf$Rf$ abe and S 
A-f$Rf$Rf$ — ^ aeb. An unambiguous solution is to create copies U and C of the 
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original grammar variables B and R, respectively, so that T T/|A/R/U/ is 
the replacement and we add new rules U -> UCje, Cf ~^ cC , and C$ — > e. The 
details are left to the reader. 

We conclude this section with the example of Bridson and Gilman [2] alluded 
to in our introduction. They derive words that encode the cutting sequence of the 
line segment from the origin to each integer lattice point in the Euclidean plane as 
the segment crosses the horizontal (h) and vertical (v) lines that join lattice points. 
Such sequences are made unique by declaring that as a segment passes through a 
lattice point, the corresponding cutting sequence adds hv. For instance, the cutting 
sequence for the segment ending at (2, 4) is the word hhvhhv. 

Example 14. The grammar is given by 

S^T$ T^T,|T^|Uq U^VU|V ^ HV ^ V 

Y$-^v Hg ^ H H,. ^ VH 11$^ h . 

Attempting to solve the grammar equations immediately runs into difhculty. The 
valid sentential forms "VqqrqS and 'YqqqrS produce words of length eight and seven 
respectively, which disallows the idea of simplification via Parikh vectors. Indeed, 
a brute-force numerical attempt using the DSV method has exponential time com- 
plexity in the length of index strings. 

We circumvent this problem by introducing two commuting formal variables 
x,y. Define L{x,y) = J2i j>o ^iJ^^U'' where Lij counts the number of cutting 
sequence words that have i many occurrences of v and j many /I's. The coefficients 
Li j comprise a frequency distribution on the first quadrant of the integer lattice. 
This distribution contains more information than the one dimensional generating 
function S{z). On the other hand, we can recover S{z) = X]n>i ^n-^" by the formula 

To compute the Lij, let us simplify the grammar by ignoring the context- 
free copying productions U — > VU|V, change the loading productions to T — ^ 
Tg|Tr|Vg, and begin by unloading sentential forms Vqa, where a G {q\r)* As 
per the proof of Lemma [3] we define a step as the application of the leftmost stack 
symbol to all non-terminals in a sentential form. If we start with an index string 
attached to V of length I, then after n steps each non-terminal will have the same 
index string of length I — n. For instance if we start with 'Vqqra- then the first 
step is Vqqra 'H-qra^ qra ■ The sccoud stcp comprises two productions and ends 
with Hro-HrcrVrCT whilc the third step unloads the r from each index and results in 
V,H,V,H,V,. 

Let Xi be the number of V non-terminals after performing step i, and let yt be 
the number of H non-terminals after performing step i. For a step unloading q 
we observe the recursions yi = yi^i + Xi^i and Xi = Xi^i. Likewise for r we see 
recursions y,; = yi-i and Xi = yi-i + Xi-i. Our simplified grammar always begins 
the unloading stage with V^^ and thus we obtain the initial condition Xi = 1 = yi 
regardless of a. This condition, along with our recursions above imply that for each 
i, the pair of integers {xi,yi) are relatively prime. 

Suppose that n is the last step needed to produce a cutting sequence word from 
Vqo-- Identify each pair (a;„, with the corresponding point in the integer lattice, 
so Xn is the total number of vertical lines crossed in the cutting sequence and yn is 
the number of horizontal lines crossed. (Note that Xn and y„ depend on a as well 
as n.) 
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We saw above that Yggr$ — > V$H$V$H$V$ — > vhvhv which is represented by 
(3,2). The pair (2,3) is realized from V,rq$ hvhhv. Indeed, the symmetry of 
the grammar imphcs that every generated pair (.t„,2/„) has a generated mirror 
image (2/rt,a;„) obtained by transposing each q and r in the index substring a 
attached to the initial unloading symbol Vg^- 

We claim that every relatively prime pair of positive integers is realized as 
{xn^Vn) for some cutting sequence word generated by our simplified grammar. We 
show this by running in reverse the algorithm that generates cutting sequences. Let 
(hi) (Ij 1) denote a coprimc pair and suppose by induction that all other rela- 
tively prime pairs (k, I) are the result of unique cutting sequence words whenever 
{k < i and I < j) or {k < i and I < j), i.e. whenever the point {k,l) is strictly 
below or left of the point (i,j). If i < j then the letter q was applied at the most 
recent step with the previous pair being defined by — i). On the other hand, 
if i > j then the rightmost letter is r and define the previous pair as {i — In 
either case the new pair of coordinates remain coprimc and lie in the induction hy- 
pothesis zone. Note that this is just the Euclidean algorithm for greatest common 
divisor and always terminates at (1,1) when the starting pair are coprime. 

Consequently the relatively prime pair is uniquely realized as (x„,?y„) from 

some cutting sequence word w generated by our simplified grammar. Furthermore 
Yqa- w satisfies \qcr\ = n which is the correct number of steps taken. 



We apply our argument above to compute the two dimensional generating func- 
tion L{x,y). For our simplified grammar we have Lij = 1 if the pair is 
relatively prime and Lij vanishes otherwise. Equivalently, Lij = 1 if and only 
if i and i + j are coprimc. To recover S{z) = 5Z„>2 Vnz"' from L{x, y) we set 
Vn = X^i+j=„ -^i,:; ) we sum along slope —1 diagonal lines in quadrant one. 
Thus Vn = <f{n) where is Euler's totient function that counts the number of inte- 
gers 1 < i < n that arc coprime to n. Summary: we have successfully circumvented 
the exponential time complexity of computing S{z) = So-e(q|r)*$ ^qa-iz) and found 



Recovering the original grammar by restoring the U productions allows for the 
construction of repeated cutting sequences w*, w being the word associated to a 
coprime lattice point This serves to add the lattice points {ti,tj). Here we 

may assume i and j arc relatively prime and t > 2 which makes these additions 
unique. (In fact, if {ti,tj) = {sk,sl) for another coprime pair {k,l), then both s 
and t are the greatest common divisor of the ordered pair and hence s = t.) The 
full grammar is in bijective correspondence with the integer lattice strictly inside 
quadrant one. Words represent geodesies in the taxicab metric. Simple observation 
shows that the full growth series is represented by the rational function 



that5(z) = E„>2^H^"- 




and as a byproduct we have re-derived Euler's identity 



n-l= V^d) (!) 

d\n,d>'L 
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5. Ambiguity 

We begin with the first pubhshed example of an inherently ambiguous context- 
free language |14[ I19j . It has an unambiguous indexed grammar. 

Example 15. Define £amb — {a'-b^a'^b'' : i,j,k,l > 1; i = fc or j = Z }. The idea 
is to divide the language into a disjoint union of the three sub-languages 

£x = {a'Va^b^ ■ j < I } = {a'b^a'b^ ■ I < j } £z = {a'b'a''b> } 

with no restrictions on i, k other than that all exponents are at least one. An 
indexed grammar is 

S^TgS T^Tg|U/|Z U^U/|X|Y 



ABAC Y ^ ACAB Z -> DBEB 



Ag-^s B/->B Bg^bB B$ 



Cf^C Cg^bC C$^bC$\b T>^aB\a E ^ aE|a 
The reader is invited to draw the typical parse tree and verify that the growth of 
this language is (^^l,\l^-^^_^lyy 

Several other inherently ambiguous context-free language can be generated un- 
ambiguously by an indexed grammar. 

Exercise 16. Another early example of an inherently ambiguous context-free lan- 
guage is featured in 4 : £ = {a"6™cP : m = n > or m = p > 0}. Write an un- 
ambiguous indexed grammar for it. Again, split the language into the disjoint union 
of three types of words and build a grammar for each. The types are a"5"c^ with 
<p <n, a'^b'^cP with < n < p, and a^b^dP with p > 0. 

Our examples beg the question: are there inherently ambiguous indexed lan- 
guages? Consider Crestin's language of palindrome pairs defined by S^cresUn = 
{vw : v,w € , v = , w = w^}. It is a "worst case" example of an inher- 

ently ambiguous context-free language (see [7] and its references). We conjecture 
that £,crestin remains inherently ambiguous as an indexed language. What about 
inherently ambiguous languages that are not context-free? 

Consider the composite numbers written in unary as per Example 1131 What 
would an unambiguous grammar for 2,comp look like? We would need a unique 
factorization for each composite c. Since the arithmetic that indexed grammars 
can simulate on unary output is discrete math (like addition and multiplication, 
no division or roots, etc), we need the Fundamental Theorem of Arithmetic. In 
fact, suppose there is a different unique factorization scheme for the composites, 
that doesn't involve a certain prime p. Then composite C2 = p^ has only the 
factorization 1 ■ C2, and similarly C3 ~ p'^ has unique factorization 1 • C3 since p- C2 is 
disallowed. But then = C2 ■ C2 ■ C2 = C3 • C3 has no unique factorization. Therefore 
all primes p are needed for any unique factorization of the set of composites. Adding 
any other building blocks to the set of primes ruins unique factorization. 

Suppose we have an unambiguous indexed grammar for £comp- It would be able 
to generate 0'' for any prime p and all fc > 1. This requires a copying mechanism 
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(in the manner of R in Examples Ffl and 1 1 3p and an encoding of p into an index string 
(recall our slogan "encode first, then copy"). In other words, our supposed grammar 
for Zcomp must be able to first produce its complement £prime and encode these 
primes into index strings. However, il8j show that the set of index strings associated 
to a non-terminal in an indexed grammar is necessarily a regular language. We 
find it highly unlikely that an indexed grammar can decode all the primes from a 
regular set of index strings. We conjecture that £,comp = {0"^ : c is composite} is 
inherently ambiguous as an indexed language. 

Recall that a word is primitive if it is not a power of another word. In the copious 
literature on the subject it is customary to let Q denote the language of primitive 
words over a two letter alphabet. It is known that Q is not unambiguously context- 
free (see [20l [21] , who exploits the original Chomsky-Schiitzenberger theorem listed 
in Section 2 above) . It is a widely believed conjecture that Q is not context-free at 
all (see p|). 

2' = {w'' : w € {a\b)* , k > 1^ defines the complement of Q with respect to 
the free monoid {a\b)* . It is not difficult to construct an ambiguous balanced 
grammar for £,' (a simple modification of Example [TT] will suffice). What about an 
unambiguous grammar? Recall from [17 that w" = implies that each Wi is a 
power of a common word v. Thus to avoid ambiguity, each building block w used 
to construct £' needs to be primitive. This means we must not only be able to 
recreate Q in order to generate £' unambiguously, we must be able to encode each 
word w G Q as a string of index symbols, as per the language of composites. Again 
we find this highly unlikely and we conjecture that £' = {w'^ : w e {a\b)* , /c > l} 
is inherently ambiguous as an indexed language. 

6. Open questions 

We observe that the generating function S{z) of an indexed language is an infinite 
sum (or multiple sums) of a family of functions related by a finite depth recursion 
(or products/sums of the same). Into what class do the generating functions of 
indexed languages fit? 

Is Crestin's language inherently ambiguous as an indexed language? What about 
the composite numbers in unary or the complement of the primitive words? 

We end where we began with the non-context-free language {a"6"c" : n > 0}. 
It has a context-sensitive grammar 

S -> abc\aBSc Ba ^ aB bB ^ bb 

for which the original DSV method works perfectly. The method fails for sev- 
eral other grammars generating this same language. What are the necessary and 
sufficient conditions to extend the method to generic context-sensitive grammars? 
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